vignettes/tutorial_associationsubgraph.Rmd
tutorial_associationsubgraph.Rmdassociationsubgraphs package, and to conducting the complete analysis including all the steps from the structure of the input data to the final visualization using an example data set.
devtools::install_github("tbilab/associationsubgraphs")
library(tidyverse)
library(associationsubgraphs)We’ll use Phecode data available in the associationsubgraphs package as an example.
The format of the input data set is similar to the example comorbidity data set, which is a dataframe including columns a and b representing the ids of the variables or nodes, and column strength that is a numeric indicator of strength of association (higher = stronger).
In this example, node pairs refer to Phecode Pairs where strength is extracted from association study and represents how strongly two nodes are associated with each other. It includes all combinations of pair-wise Phecodes, which will be used for the analysis.
Please remove node pairs with NA missing values of strength in the data set.
data("association_pairs")
association_pairs = association_pairs %>%
arrange(desc(strength)) %>%
filter(!is.na(strength))
dim(association_pairs)## [1] 1462623 3
| a | b | strength |
|---|---|---|
| 296.00 | 300.00 | 131.0360 |
| 381.20 | 389.00 | 129.1788 |
| 173.00 | 216.00 | 127.6532 |
| 173.00 | 702.00 | 125.0173 |
| 636.00 | 655.00 | 122.8753 |
| 381.00 | 389.00 | 122.6985 |
594.10: Calculus of kidney with other Phecodes. To explore and visualize the large-scale associations even if you have a particular interest in a node in your network, using the network that includes all combinations of pair-wise Phecodes rather than the network only includes the connection of the interested nodes with other nodes.| a | b | strength |
|---|---|---|
| 594.10 | 594.30 | 73.60715 |
| 594.10 | 595.00 | 67.11950 |
| 594.10 | 785.00 | 42.85690 |
| 594.10 | 594.80 | 37.15580 |
| 594.10 | 594.20 | 26.98550 |
| 594.10 | 751.00 | 24.45370 |
data("phecode_def")
head(phecode_def) %>%
dplyr::select(phecode,description,group) %>%
knitr::kable()| phecode | description | group |
|---|---|---|
| 008.00 | Intestinal infection | infectious diseases |
| 008.50 | Bacterial enteritis | infectious diseases |
| 008.51 | Intestinal e.coli | infectious diseases |
| 008.52 | Intestinal infection due to C. difficile | infectious diseases |
| 008.60 | Viral Enteritis | infectious diseases |
| 008.70 | Intestinal infection due to protozoa | infectious diseases |
calculate_subgraph_structure() to calculate the set of subgraphs at every threshold, and the associations were sorted in descending order of strength. The nodes connected by the highest association strength are set as a cluster. The second-highest association strength is added. Specifically, if at least one node in the nodes pair with the second-highest association is shared with the first cluster that built based on the highest association, the nodes pair with the second-highest association will be added to the existing cluster including the non-shared node. Otherwise, if both nodes with the second-highest association are not shared with the first cluster, then a new separate cluster is created. This procedure is repeated for all association pairs.
subgraphs <- association_pairs %>%
calculate_subgraph_structure()
subgraphs %>%
dplyr::select(-subgraphs) %>%
head() %>%
knitr::kable()| step | n_edges | strength | n_nodes_seen | n_subgraphs | max_size | rel_max_size | avg_size | avg_density | n_triples |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 131.0360 | 2 | 1 | 2 | 1.0000000 | 2.000000 | 1.0000000 | 0 |
| 2 | 2 | 129.1788 | 4 | 2 | 2 | 0.5000000 | 2.000000 | 1.0000000 | 0 |
| 3 | 3 | 127.6532 | 6 | 3 | 2 | 0.3333333 | 2.000000 | 1.0000000 | 0 |
| 4 | 4 | 125.0173 | 7 | 3 | 3 | 0.4285714 | 2.333333 | 0.8888889 | 1 |
| 5 | 5 | 122.8753 | 9 | 4 | 3 | 0.3333333 | 2.250000 | 0.9166667 | 1 |
| 6 | 6 | 122.6985 | 10 | 4 | 3 | 0.3000000 | 2.500000 | 0.8333333 | 2 |
Values in column a and b of association_pairs will be shown in the visualization, converting Phecode to Phecode description for better visualization.
Preparing a dataframe that has a column id that corresponds to the variables coded in a and b of association_pairs that contains additional info of the nodes. For example, color and Phecode category were added to each Phecode. And the added information will be shown in the table after clicking a subgraph to see details.
#convert Phecode to Phecode description
association_pairs = association_pairs %>%
rename(phecode=a) %>%
left_join(.,phecode_def[,c("phecode","description")],by="phecode") %>%
rename(a=description) %>%
dplyr::select(-phecode) %>%
rename(phecode=b) %>%
left_join(.,phecode_def[,c("phecode","description")],by="phecode") %>%
rename(b=description) %>%
dplyr::select(-phecode)
#extract node id
node_info <- c(association_pairs$a,association_pairs$b) %>%
unique() %>%
as_tibble() %>%
rename(id = value) %>%
left_join(.,phecode_def %>% dplyr::select(phecode,description,group,color) %>% dplyr::rename(id=description),by="id") %>%
arrange(group)
head(node_info) %>%
knitr::kable()| id | phecode | group | color |
|---|---|---|---|
| Hypertensive chronic kidney disease | 401.22 | circulatory system | #D14285 |
| Rheumatic disease of the heart valves | 394.00 | circulatory system | #D14285 |
| Coronary atherosclerosis | 411.40 | circulatory system | #D14285 |
| Cardiomyopathy | 425.00 | circulatory system | #D14285 |
| Cardiac pacemaker in situ | 426.91 | circulatory system | #D14285 |
| Primary/intrinsic cardiomyopathies | 425.10 | circulatory system | #D14285 |
Visualizing the subgraph search
The subgraphs includes the rel_max_size: size of the largest subgraph relative to the combined size of all other subgraphs, n_subgraphs: the number of subgraphs and n_triples: the number of subgraphs with at least three members. We choose the minimum rel_max_size, maximum n_subgraphs and n_triples to take a look.
Using “largest-smallest” rule for finding the optimal threshold by tracking the minimum size of the largest subgraph relative to the combined size of all other subgraphs.
min_rel <- subgraphs %>%
filter(rel_max_size == min(rel_max_size)) %>%
tail(1)
max_num_subgraphs <- subgraphs %>%
filter(n_subgraphs == max(n_subgraphs)) %>%
tail(1)
max_num_triples <- subgraphs %>%
filter(n_triples == max(n_triples)) %>%
tail(1)
subgraphs %>%
# filter(rel_max_size < 0.5) %>%
dplyr::select(
strength,
n_subgraphs,
max_size,
rel_max_size,
avg_density,
n_triples,
step
) %>%
pivot_longer(-step) %>%
filter(step <= 5000) %>%
ggplot(aes(x = step, y = value)) +
geom_step() +
geom_vline(xintercept = min_rel$step, color = 'orangered') +
geom_vline(xintercept = max_num_subgraphs$step, color = 'forestgreen') +
geom_vline(xintercept = max_num_triples$step, color = 'steelblue') +
facet_grid(rows = vars(name), scales = "free_y")
#visualize
visualize_subgraph_structure(
association_pairs,
node_info = node_info,
subgraph_results = subgraphs,
trim_subgraph_results = TRUE
)Calculus of kidney, simply supply the id of "Calculus of kidney" to the visualize_subgraph_structure() function and you will be automatically taken to where Calculus of kidney first gets grouped into a subgraph.
#visualize
visualize_subgraph_structure(
association_pairs,
node_info = node_info,
subgraph_results = subgraphs,
trim_subgraph_results = TRUE,
pinned_node = "Calculus of kidney"
)